Preliminary Analysis of a Slavic Parallel Corpus

نویسنده

  • Emmerich Kelih
چکیده

The focus of this paper is on a detailed description of a newlydeveloped parallel corpus of Slavic languages. It consists of 11 Slavic translations of the well-known Russian socialist realist novel “Kak zakaljalas’ stal’/How the steel was tempered” (KZS), written by N.A. Ostrovskij in the years 1932-34. The KZS contains the Slovene, Croatian, Serbian (ekavian), Macedonian, Bulgarian, Ukrainian, Belorussian, Slovak, Czech, Polish and Upper Sorbian translations. Thus, for the first time a parallel text of almost all Slavic standard languages is available. In addition to the discussion of some text-specific issues of KZS, an explorative statistical analysis and a linguistic interpretation of text length and the Type-Token Ratio is offered.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The type-token relationship in Slavic parallel texts

The aim of the paper is to analyse the statistical regulation of the type token relationship in Slavic parallel texts. Furthermore it is shown that this relationship in parallel texts can be explained due to morphological and typological characteristics. Keyords: type-token relationship, Slavic languages, corpus, parallel texts

متن کامل

Building Language Resources and Translation Models for Machine Translation Focused on South Slavic and Balkan Languages

The aim of this short-term project was to investigate the feasibility of machine translation (MT) research and development for several South Slavic and Balkan languages, more precisely Romanian, Bulgarian, Slovene, Greek and Serbian. For these languages, MT systems are scarce and for some of them even non-existent. We provide a brief description of the project’s major research tasks: Compilatio...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Multi-source morphosyntactic tagging for spoken Rusyn

This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolki...

متن کامل

Are Web Corpora Inferior? The Case of Czech and Slovak

Our paper describes an experiment aimed to assessment of lexical coverage in web corpora in comparison with the traditional ones for two closely related Slavic languages from the lexicographers’ perspective. The preliminary results show that web corpora should not be considered ―inferior‖, but rather ―different‖.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009